The Automatic Thai Sentence Extraction

نویسندگان

  • PRADIT MITTRAPIYANURUK
  • VIRACH SORNLERTLAMVANICH
چکیده

Unlike English, there is no explicit sentence marker in the Thai language. Conventionally, space is placed at the end of sentence in Thai writing. But it does not mean that space always indicates the sentence boundary. It is also used as other purposes [Danvivathana 1987]. This paper presents an algorithm to extract sentences from paragraph by detecting the true sentence breaking spaces, by applying the statistical part-of-speech (POS) tagging technique to the space classification problem. The algorithm considers 2 consequent strings with a space in between each time for determining the space as whether a true sentence breaking space or not. We divided the ORCHID Thai POS tagged corpus into 10 portions for cross-validation test. The evaluation result shows that the average accuracy of space classification and break-space detection are 85.26% and 79.82% respectively and the average of false-break rate is 8.75%. Our approach also shows a significant improvement to the traditional statistical POS tagging technique. The average of POS tagging error rate reduction is as high as 11.3%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Lexicalized Tree Adjoining Grammar for Thai

This paper describes an alternative formalism for Thai syntax parsing based on a lexicalized tree adjoining grammar (LTAG). We first briefly present some formal background concerning LTAG, which is necessary for an understanding of LTAG and its application to Thai. Specifically, we address several issues regarding difficulties in parsing Thai sentences and how to resolve these issues using LTAG...

متن کامل

Automatic Corpus-based Thai Word Extraction

The Thai language is infamous in its ambiguity. One of its important ambiguities is that there is no explicit word boundary, or in other words there is no explicit definition what words are. Traditional methods on defining words, which depend on human judgement, base on unclear criteria or procedures, and have several limitations. This paper describes an automatic statistical method Thai word e...

متن کامل

Issues in Thai Text - to - Speech Synthesis : The NECTEC Approach 1

This paper presents all the essential issues in developing the text-to-speech synthesis for Thai text analysis, prosody generation and speech synthesis. In the text analysis, problems in Thai text processing can be decomposed into the models of sentence extraction, phrase boundary determination and grapheme-to-phoneme conversion. The syllable duration and F0 contour generation rules are include...

متن کامل

Issues in Thai Text-to-Speech Synthesis: The NECTEC Approach

This paper presents all the essential issues in developing the text-to-speech synthesis for Thai text analysis, prosody generation and speech synthesis. In the text analysis, problems in Thai text processing can be decomposed into the models of sentence extraction, phrase boundary determination and grapheme-to-phoneme conversion. The syllable duration and F0 contour generation rules are include...

متن کامل

Evaluation Measures Considering Sentence Concatenation For Automatic Summarization By Sentence Or Word Extraction

Automatic summaries of text generated through sentence or word extraction has been evaluated by comparing them with manual summaries generated by humans by using numerical evaluation measures based on precision or accuracy. Although sentence extraction has previously been evaluated based only on precision of a single sentence, sentence concatenations in the summaries should be evaluated as well...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000